An Empirical Study on Rule Granularity and Unification Interleaving Toward an Efficient Unification-Based Parsing System
نویسنده
چکیده
This paper describes an empirical study on the optimal granularity of the phrase structure rules and the optimal strategy for interleaving CFG parsing with unification in order to implement an eltlcient unification-based parsing system. We claim that using "medium-grained" CFG phrase structure rules, which balance tile computational cost of CI?G parsing and unification, are a cost-effective solution for making unification-based grammar both efficicnt and easy to maintain. We also claim that "late unification", which delays unification until a complete CI"G parse is found, saves unnecessary copies of DAGs for irrelevant subparses and improves performance significantly. The effectiveness of these methods was proved in an extensive experiment. The results show that, on average, the proposed system parses 3.5 times faster than our previous one. The grammar and the parser described in this paper are fully implemented and ased as the .lapmmse analysis module in SL-TRANS, the speech-to-speech translation system of ATR. 1 I n t r o d u c t i o n Uuifieation-based framework bins been an area of active research in natural language processing. Unification, wbich is the primary operation of ibis frame.work, provides a kind of constraint-checking mechanism for nlerging varioas information sources, sllcb as syntax, semantics, and pragmatics. The computational inefficiency of unification, however, precludes tile development of large practical NLP systems, although the framework has many attractiw~ theoretical properties. The efforts made to improve tile efficiency of a uriitication-ba.sed parsing system can be classified into four categories. • CFG parsing algorithm • Graph unification algorithm • Granunar representation and organizati(m • Interaction between CFG parsing and unilication There bave been well-known efficient CFG parsing algorithms such as CKY [Aho mid UllHnm, 77], Ear~ ley [Earley, 70], CtIAffl ' (Kay, 80], eatd I,R [Aho and Ullmaa L 77] ['t'omita, 86]. There have also been several recent in-depth studies into efficient graph unification algoritbms, whose main concerns have been either avoiding irrelevant copies of l)AGs [Karttunen and Kay, 85] [Pereira, 85] [Karttun .... 86] [Wroblewski, 87] [Godden, 90] [Kogure, 90] [Tomabechi, 91] [E1aele, 91], or the exhaustive expansion of disjunctions into their disjunctive normal forms [Kasper, 87] [Eisele mad l)Srre, 88] [Maxwell and Kaplmh 89] [l)arre and l~i,~ele, 90] ickier, 901 [Nat ...... 91]. There has, however, been litth: discussion regarding the optimal representation of a grammar, or linguistic knowledge, in the unification-based framework, from tile engineering point of view. Grammar organization is highly flexible, as tile unification-based framework uses two different forms of knowledge representation; atomic phrase structure rules and feature structure descriptions. Method selection greatly at" facts both the computational elficieney and the maiutenauce cost of the system. There luL~ also been little discussion regarding optimal interaction between the CFG parsing process and the unification process in unificatlon-based parsing, which also greatly af|~ct~; overall performance. Here we introduce the notion of granularity, and suggest mcdium-gra~ued phrase structure rules, in which morph~.syutactic specifications in the teature descriptious are expallded into phrase structure rules. We claim that it reduce the computational loads of unification without intractably increasing tim lmulber of rules, and it is optimal ill tile sense that it sa t i s ties both ettleiency and maintainability. We also suggest late unification as another ~lut ion to tim COl)ylug problem, as it avoids unnecessary copies of irrel evant subparses by delaying unification mttil a COlnph:te CI,'G parse is found. In tile following sections, the design and iml)lemen tatiun of tim medimn-grained phrase structure rules in explailmd, then the implementation of the late uni: tication is illustrated, anti finally the elfectiveness of the proposed nlethods is proven in experiments. ACrF.S DE COLING-92, NANTES, 23-28 AO~rf 1992 1 7 7 Pat~. OF COt.lNG-92, NAN'ri~.S, Aua. 23-28, 1992 Granular i ty of Phrase Constraints in Phrase Constraints in Nmnber of Phrase Structure Rules Structure ]Eules Feature Descriptions Structure Rules Extremely-Coarse-Grained weak very strong . 1 ~ 10 Coarse-Grained medium strong 10 ~ 100 Medium-Grained strong medimn 100 ~ 1000 Fine-Grained very strong weak 1000 Table 1: Granular i ty of phrase structure rules characterized by tile number of rules and the s trength of linguistic constraints in tile phrase structure rules aJtd the feature descriptions 2 T h e Granular i ty o f Phrase S t r u c t u r e Rules 2 . 1 G r a n u l a r i t y Phrase s t ructure rule granular i ty has been introduced to refer to the amount of linguistic constraints specified in the atomic CFG phrase s tructures rules without annotations. The rule granulari ty spectrum has been classified into four categories as shown in Table 1, using the number of g rammar rules ms a ruessure. Unification-based grammars , in general, are characterized by a few general annotated pbrase s tructure rules, and a lexicon with specific linguistic descriptions. This is especially true for HPSG [Pollard and Sag, 87] and JPSG [Gunji, 87], which are to be categorized as extremely-coarse grained, as they drastically reduce the nmnber of phrase s t ructure rules into two for English and one for Japanese, respectively. In these frameworks, the only role of the phrase structure rules is to provide a device for combining a head with its complement. Most linguistic constraints are stored in the feature descriptions. Coarse-grained rules have been characterized as a g rammar consisting of atonfic phrase structure rules with medium constraints, and feature descrip tions with s t rong constraints. Medium-grained rules have been characterized as a g rammar consisting of atomic phrase s tructure rules witb strong constraints, and feature descriptions with mediuln constraints. Medium-grained rules differ from coarsegrained rules in tha t they include morpho~syntax in the phrase s tructure rules, while coarse-grained rules include them in the feature descriptions. This means that medium-grained rules are strong enough to derive syntactic s t ructures from atomic phrase structure rules without feature descriptions. Grammars for conventional NLP systems using simple or augmented CFG fall into the category of fine-grained rules, which represent most of linguistic constraints as CFG phrase s t ructure rules, and the number of rules usually amounts to an intractable number of several thousands for practical applications. 2 . 2 M a i n t a i n a b i l i t y a n d E f f i c i e n c y In unification-based framework, a linguistic constraint cart either be described as atomic context-flee phrase s tructure rules, or as feature descriptions in annotat ions and lexical entries. As the number of atomic phrase s tructure rules decreases, the number of feature descriptions increases. It is true that the lexieo-syntactic approach makes tile granunar modular and improves its maintainability by reducing the number of rules. However, it must be noted tha t the computat ional cost of disjunctive feature s tructure unification, in the worst ease, is exponential in the nmnber of disjunctions [Kasper, 87], whereas tile cost of CFG parsing is o(N s) in the input length N. Therefore, extreme rule reduction results in inefficiency. This overwhelms the benefits of the maintainabil i ty of the reduced number of rules since grammar development is essentially a trial-and-error process and requires a short tu rn-around time. However, the cost for CFG parsing also increases as the number of rules increases. Therefore, we must chose tile granulari ty so tha t the reduction in unification cost outweighs tile increase in CFG parsing cost, in order to gain overall etfieiency. 3 T h e H P S G B a s e d J a p a n e s e G r a m m a r s In this section, we illustrate the difference between "coarse-grained" rules and "medium-grained" rules using our HPSG-based spoken-style Japanese grarnlna l 'S as a n e x a l u p l e , We have developed two unification-based grammars with different granular i ty l, which are essentially based on tIPSG and its application to Japanese (JPSG), for the analysis module [Nagata and Kogure, 90] of an experimental Japanese-to-English speech-tospeech translation system (SL-TRANS) [Morimoto et a l . , 90]. We have selected the "secretarial service of an international conference registration" as our task domain, in which a conversation between a secretary and a q u e s t i o n e r is c a r r i e d o u t . T i l e J a p a n e s e g r a u u n a r s ~ however, ~tre not task-specific, but ra ther generalpurpose OlleSj which cover a wide range of pllenonlI Historically speaking, we fil~t developed coarse-grained rules &lid then we nlallllally tl 'al |sfonned them in to mediumgrldned rules for e|licicncy. ACTES DE COL1NG-92, NAmes, 23-28 hotyr 1992 1 7 8 PRoc. OF COLING-92, NANTES, AUG. 23-28, 1992 ena at ruazly linguistic levels from syntax, and seman tics, to pragmatics using typcd feature s tructure descriptions. The linguistic phenomena covered in these grammars include: • l , 'undamental Constructions: causative, passive, benefactive, negation, interrogative, etc., • Control and Gaps: subjec t /objec t control, • Unbounded Dependencies: topic, relative, • Word Order Variation and Ellipsis. 3.1 C o a r s e G r a i n e d R u l e s vs . M e d i u m G r a i n e d R u l e s The coarse-grained HPSG-based Japanese grammar has about 20 generalized phrase s tructure rules, while the medium-grained grarmnar has about 200 phrase s t ructure rules. Both gra, lnmars use the same lexicon with a vocabulary of about 400. ~ In the coarse-grained grarmnar, phrase structure rules only refer to the relative position l)etween the five basic syntactic categories for Japanese: verb (V), noun (N), adverb (ADV), postposition (P), and attributive (ATT). Most of the specific linguistic information is encoded as feature descriptions in either the annotat ion of the l)hrase s tructure rules or the lexical entries. In principle, there is no distinction as to whether a constituent is lexical or phrasal, and no subcategories of the 5 basic categories. This contributes greatly to the reduction in the numbcr of phrase s tructure rules, which results in better grammar maintainability. We present all the phrase structure rules of the coarse-grained Japanese grammar in Appendix A. It has been noticed tha t the extensive use of dis-junctions in feature descriptions, which results from the reduction of the number of phrase structure rules, is the main cause of incfficieney in the coarse-grained version of the grammar. The three major sources of disjunctions are, lnorpho-syntactic specifications for diverse expressions in the final par t of the sentence, frec word order and ellipsis of verb complements (subeat slash scrambling), and semantic interpretation of deep case and aspect, where the first two particularly are the problems in spoken-style Japanese. We have manually converted the coarse-grained phrase s t ructure rules into medium-grained rules to reduce thc computational cost of unilication. First, we divided each of the basic categories into several subcategories. Then, we divided the coarse-grained phrase s t ructure rules according to the subcategorics. qb kee I) the grammar readable, however, we choose to leave the subcat slash scrambling and the semantic 2We also }lave aalother vel~iOll of tile gF~nlllal" for tile sam,: 8t|bCOll)lls, whlcll is nsed for tile continuous speech l'ecognition module [Takez.awa el, ~xl. , 911. It only imea atoaalc CFG la31esp a31d the ]lulll}~r of rules ~llOUll{S to Inol~ thall 2,0(~, It is, thcrefore~ categolJzed ~-s a tin,grained gr~Htnar in our defiltition. interpretation undone, and nmde extensive efforts to expand the morpho-syntactic speeificatioas. 3.2 E x a m p l e : M e d i u m G r a i n e d Rules for P r e d i c a t e Verb P h r a s e s In this section, we illustrate the process of t r ans fof marion using a predicate verb l)hrasc production rulc as an example. Japanese predicate phrases consist of a main verb followed by a sequence of auxiliaries azld sentence final particles. There is an ahnost ottodimensional order of verbal constituents such as in Figure l , which reflects the basic hierarchy of the J apanese sentence structure. Kernel verbs occur first in a predicate phrase sequence. Voice auxiliaries precede all other auxiliaries, and within this category, the causative auxiliary (sa)se,'u precedes the passive auxiliary (ra)re~t. Aspect auxiliaries, such a.s the progressive auxiliary (Ie)ivu precede modal auxiliaries ;rod follow voice auxiliaries. Modal auxiliaries are classified into two groups with respect to the relative order of negative and tense auxiliaries. Mood1 iuehldes the optative arlxiliaries, such as tai (want), beki (should/must) , etc. Mood2 includes the evidential or inferential auxiliarics such as rashii (seem/look), kamoshirenai (may), etc. Negative auxiliaries uai, u (not) follow voice, aspect, and mood l auxiliaries, and precede tense and mood2 auxiliaries. Tease auxiliaries la, da (-ed) show irregular behavior. They follow the voice, aspect, mood1, and negative auxiliaries, and precede the mood2 auxiliaries. They also can tollow the mood2 auxiliaries. in the coarse-grained grammar, we provide a single phrase structure rule for the phcnomena.
منابع مشابه
Feature extraction in opinion mining through Persian reviews
Opinion mining deals with an analysis of user reviews for extracting their opinions, sentiments and demands in a specific area, which can play an important role in making major decisions in such area. In general, opinion mining extracts user reviews at three levels of document, sentence and feature. Opinion mining at the feature level is taken into consideration more than the other two levels d...
متن کاملRelating Complexity to Practical Performance in Parsing with Wide-Coverage Unification Grammars
The paper demonstrates that exponential complexities with respect to grammar size and input length have little impact on the performance of three unification-based parsing algorithms, using a wide-coverage grammar. The results imply that the study and optimisation of unification-based parsing must rely on empirical data until complexity theory can more accurately predict the practical behaviour...
متن کاملMemory-Efficient and Thread-Safe Quasi-Destructive Graph Unification
In terms of both speed and memory consumption, graph unification remains the most expensive component of unification-based grammar parsing. We present a technique to reduce the memory usage of unification algorithms considerably, without increasing execution times. Also, the proposed algorithm is thread-safe, providing an efficient algorithm for parallel processing as well.
متن کاملEfficient Parsing with Large-Scale Unification Grammars
The efficiency problem in parsing with large-scale unification grammars, including implementations in the Head-driven Phrase Structure grammar (HPSG) framework, used to be a serious obstacle to their application in research and commercial settings. Over the past few years, however, significant progress in efficient processing has been achieved. Still, many of the proposed techniques were develo...
متن کاملGeneralized Probabilistic LR Parsing of Natural Language (Corpora) with Unification-Based Grammars
We describe work toward the construction of a very wide-coverage probabilistic parsing system for natural language (NL), based on LR parsing techniques. The system is intended to rank the large number of syntactic analyses produced by NL grammars according to the frequency of occurrence of the individual rules deployed in each analysis. We discuss a fully automatic procedure for constructing an...
متن کامل